skip to main content


Search for: All records

Creators/Authors contains: "Wang, Zhengyang"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available August 10, 2025
  2. Free, publicly-accessible full text available August 4, 2024
  3. Free, publicly-accessible full text available August 4, 2024
  4. Product catalogs, conceptually in the form of text-rich tables, are self-reported by individual retailers and thus inevitably contain noisy facts. Verifying such textual attributes in product catalogs is essential to improve their reliability. However, popular methods for processing free-text content, such as pre-trained language models, are not particularly effective on structured tabular data since they are typically trained on free-form natural language texts. In this paper, we present Tab-Cleaner, a model designed to handle error detection over text-rich tabular data following a pre-training / fine-tuning paradigm. We train Tab-Cleaner on a real-world Amazon Product Catalog table w.r.t millions of products and show improvements over state-of-the-art methods by 16% on PR AUC over attribute applicability classification task and by 11% on PR AUC over attribute value validation task. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  5. Knowledge graph embeddings (KGE) have been extensively studied to embed large-scale relational data for many real-world applications. Existing methods have long ignored the fact many KGs contain two fundamentally different views: high-level ontology-view concepts and fine-grained instance-view entities. They usually embed all nodes as vectors in one latent space. However, a single geometric representation fails to capture the structural differences between two views and lacks probabilistic semantics towards concepts’ granularity. We propose Concept2Box, a novel approach that jointly embeds the two views of a KG using dual geometric representations. We model concepts with box embeddings, which learn the hierarchy structure and complex relations such as overlap and disjoint among them. Box volumes can be interpreted as concepts’ granularity. Different from concepts, we model entities as vectors. To bridge the gap between concept box embeddings and entity vector embeddings, we propose a novel vector-to-box distance metric and learn both embeddings jointly. Experiments on both the public DBpedia KG and a newly-created industrial KG showed the effectiveness of Concept2Box. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  6. Abstract Motivation

    Properties of molecules are indicative of their functions and thus are useful in many applications. With the advances of deep-learning methods, computational approaches for predicting molecular properties are gaining increasing momentum. However, there lacks customized and advanced methods and comprehensive tools for this task currently.

    Results

    Here, we develop a suite of comprehensive machine-learning methods and tools spanning different computational models, molecular representations and loss functions for molecular property prediction and drug discovery. Specifically, we represent molecules as both graphs and sequences. Built on these representations, we develop novel deep models for learning from molecular graphs and sequences. In order to learn effectively from highly imbalanced datasets, we develop advanced loss functions that optimize areas under precision–recall curves (PRCs) and receiver operating characteristic (ROC) curves. Altogether, our work not only serves as a comprehensive tool, but also contributes toward developing novel and advanced graph and sequence-learning methodologies. Results on both online and offline antibiotics discovery and molecular property prediction tasks show that our methods achieve consistent improvements over prior methods. In particular, our methods achieve #1 ranking in terms of both ROC-AUC (area under curve) and PRC-AUC on the AI Cures open challenge for drug discovery related to COVID-19.

    Availability and implementation

    Our source code is released as part of the MoleculeX library (https://github.com/divelab/MoleculeX) under AdvProp.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  7. null (Ed.)
  8. Abstract

    The Hengduan Mountains region is a biodiversity hotspot known for its topologically complex, deep valleys and high mountains. While landscape and glacial refugia have been evoked to explain patterns of interspecies divergence, the accumulation of intra‐species (i.e., population level) genetic divergence across the mountain‐valley landscape in this region has received less attention. We used genome‐wide restriction site‐associated DNA sequencing (RADseq) to reveal signatures of Pleistocene glaciation in populations ofThitarodes shambalaensis(Lepidoptera: Hepialidae), the host moth of parasiticOphiocordyceps sinensis(Hypocreales: Ophiocordycipitaceae) or caterpillar fungus” endemic to the glacier of eastern Mt. Gongga. We used moraine history along the glacier valleys to model the distribution and environmental barriers to gene flow across populations ofT.shambalaensis. We found that moth populations separated by less than 10 km exhibited valley‐based population genetic clustering and isolation‐by‐distance (IBD), while gene flow among populations was best explained by models using information about their distributions at the local last glacial maximum (LGML, 58 kya), not their contemporary distribution. Maximum likelihood lineage history among populations, and among subpopulations as little as 500 m apart, recapitulated glaciation history across the landscape. We also found signals of isolated population expansion following the retreat of LGMLglaciers. These results reveal the fine‐scale, long‐term historical influence of landscape and glaciation on the genetic structuring of populations of an endangered and economically important insect species. Similar mechanisms, given enough time and continued isolation, could explain the contribution of glacier refugia to the generation of species diversity among the Hengduan Mountains.

     
    more » « less